Dynamic Lexica for Query Translation

نویسندگان

  • Jussi Karlgren
  • Magnus Sahlgren
  • Timo Järvinen
  • Rickard Cöster
چکیده

This experiment tests a simple, scalable, and effective approach to building a domain-specific translation lexicon using distributional statistics over parallellized bilingual corpora. A bilingual lexicon is extracted from aligned Swedish-French data, used to translate CLEF topics from Swedish to French, which resulting French queries are then in turn used to retrieve documents from the French language CLEF collection. The results give 34 of fifty queries on or above median for the “precision at 1000 documents” recall oriented score; with many of the errors possible to handle by the use of string-matching and cognate search. We conclude that the approach presented here is a simple and efficient component in an automatic query translation system. 1 Lexical Resources Should Be Dynamic Multilingual information access applications, which are driven by modeling lexical correspondences between different human languages, are obviously reliant on lexical resources to a high degree — the quality of the lexicon is the main bottleneck for quality of performance and coverage of service. While automatic text and speech translation have been the main multilingual tasks for most of the history of computational linguistics, today the recent awareness within the information access field of the multilingual reality of information sources has made the availability of lexica an all the more critical system component. Machine readable lexica in general, and machine readable multilingual lexica in particular, are difficult to come across. Manual approaches to lexicon construction vouch for high quality results, but are timeand labour-consuming to build, costly and complex to maintain, and inherently static as to their nature: tuning an existing lexicon to a new domain is a complex task that risks compromising existing information and corrupting usefulness for previous application areas. As a specific case, human-readable dictionaries, even if digitized and made available to automatic processing, are not vectored towards automatic processing. Dictionaries originally designed for human perusal leave much information unsaid, and belabor fine points that may not be of immediate use for the computational task at hand. Automatic lexicon acquisition techniques promise to provide fast, cheap and dynamic alternatives to manual approaches, but have yet to prove their viability. In addition to this, they typically require sizeable computational resources. C. Peters et al. (Eds.): CLEF 2004, LNCS 3491, pp. 150–155, 2005. c © Springer-Verlag Berlin Heidelberg 2005 Dynamic Lexica for Query Translation 151 This experiment utilises a simple and effective approach to using distributional statistics over parallellized bilingual corpora — text collections of material translated from one language to another — for automatic multilingual lexicon acquisition and query translation. The approach is efficient, fast and scalable, and is easily adapted to new domains and to new languages. We evaluate the proposed methodology by first extracting a bilingual lexicon from aligned Swedish-French data, translating CLEF topics from Swedish to French, and then retrieving documents using the resulting French queries and a mono-lingual retrieval system from the French section of the CLEF document database. The results clearly demonstrate the viability of the approach. 2 Cooccurrence-Based Bilingual Lexicon Acquisition Cooccurrence-based bilingual lexicon acquisition models typically assume something along the lines: “... If we disregard the unassuming little grammatical words, we will, for the vast majority of sentences, find precisely one word representing any one given word in the parallel text. Counterterms do not necessarily constitute the same part of speech or even belong to the same word class; remarkably often, corresponding terms can be identified even where the syntactical structure is not isomorphic”.[1] or alternatively formulated “... words that are translations of each other are more likely to appear in corresponding bitext regions than other pairs of words”. [2] These models, first implemented by Brown and colleagues [3] use aligned parallel corpora, and define a translational relation between terms that are observed to occur with similar distributions in corresponding text segments. Our approach, the Random Indexing approach, by contrast with most other approaches to distributionally based algorithms for bilingual lexicon acquisition, takes the context — an utterance, a window of adjacency, or when necessary, an entire document — as the primary unit. Rather than building a huge vector space of contexts by lexical item types, we build a vector space which is large enough to accommodate the occurrence information of tens of thousands of lexical item types in millions of contexts, yet compact enough to be tractable; constant in size in face of ever-growing data sizes; and designed to model association between distributionally similar lexical items without compilation or explicit dimensionality reduction. 2.1 Random Indexing for Bilingual Lexicon Acquisition Random Indexing [4, 5] is a technique for producing context vectors for words based on cooccurrence statistics. Random Indexing differs from other related vector space methods, such as Latent Semantic Indexing/Analysis ([6, 7]), by not 152 J. Karlgren et al. requiring an explicit dimension reduction phase in order to construct the vector space. Instead of collecting the data in a word-by-word or word-by-document cooccurrence matrix that needs to be reduced using computationally expensive matrix operations, Random Indexing incrementally collects the data in a context matrix with fixed dimensionality k, such that k D < V , where D is the size of the document collection, and V is the size of the vocabulary. The fact that no dimension reduction of the resulting matrix is needed makes Random Indexing very efficient and scalable. Furthermore, it can be used for both word-based and document-based cooccurrences. The Random Indexing procedure is a two-step operation: 1. A unique k-dimensional index vector consisting of a small number of randomly selected non-zero elements is assigned to each context in the data. 2. Context vectors for the words are produced by scanning through the data, and each time a word occurs, the context’s k-dimensional index vector is added to the row for the word in the context matrix. When the entire text has been scanned, words are represented in the context matrix by k-dimensional context vectors that are effectively the sum of the words’ contexts. In order to apply this methodology to the problem of automatic bilingual lexicon acquisition, we use aligned parallel data, and define a context as an aligned text segment (in this case documents, since the data used in these experiments are aligned at document level). Each such aligned text segment is then assigned a unique random index vector, which are used to accumulate context vectors for the words in both languages by the procedure described above: every time a word occurs in a particular text segment, the index vector of the text segment is added to the context vector for the word in question. The result is a bilingual vector space, which effectively constitutes a bilingual lexicon in the sense that translations (hopefully!) will occur close to each other in the vector space. Thus, in order to extract a translation to a given word, we simply compute the similarity (using the cosine of the angles between the vectors) between the context vector for the word in question and the context vectors for all words in the other language. The word in the other language whose context vector is most similar to the context vector for the given word is then selected as translation. The approach is described in more detail in [8, 9].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lexicon and Corpora for Speech to Speech Translation (LC-STAR)

The objective of the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components) is corpora collection and lexica creation for the purposes of Automatic Speech Recognition (ASR) and Text-to-speech (TTS) that are needed in speech-to-speech translation (SST). During the lifetime of the project (2002-2005) these lexica will be specified, built and validated. Large lexica co...

متن کامل

Large lexica for speech-to-speech translation: from specification to creation

This paper presents the corpora collection and lexica creation for the purposes of Automatic Speech Recognition (ASR) and Text-to-speech (TTS) that are needed in speech-to-speech translation (SST). These lexica will be specified, built and validated within the scope of the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components) during the years 2002-2005. Large lexic...

متن کامل

xLiD-Lexica: Cross-lingual Linked Data Lexica

In this paper, we introduce our cross-lingual linked data lexica, called xLiD-Lexica, which are constructed by exploiting the multilingual Wikipedia and linked data resources from Linked Open Data (LOD). We provide the cross-lingual groundings of linked data resources from LOD as RDF data, which can be easily integrated into the LOD data sources. In addition, we build a SPARQL endpoint over our...

متن کامل

Towards producing bilingual lexica from monolingual corpora

Bilingual lexica are the basis for many cross-lingual natural language processing tasks. Recent works have shown success in learning bilingual dictionary by taking advantages of comparable corpora and a diverse set of signals derived from monolingual corpora. In the present work, we describe an approach to automatically learn bilingual lexica by training a supervised classifier using word embed...

متن کامل

Creation and Validation of Large Lexica for Speech-to-Speech Translation Purposes

This paper presents specifications and requirements for creation and validation o f large lexica that are needed in automatic Speech Recognition (ASR), Text-to-Speech (TTS) and statistical Speech-to-Speech Translation (SST) systems . The prepared language resources are created and validated within the scope o f the EU-project LC-STAR (Lexica and Corpora for Speech-toSpeech Translation Component...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004